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Abstract 

A simple and computationally efficient scheme for tree-structured vector quantization is presented. Unlike pre- 
vious methods, its quantization error depends only on the intrinsic dimension of the data distribution, rather than the 
apparent dimension of the space in which the data happen to lie. 



1 Introduction 

For a distribution P on R^, the kth quantization error is commonly defined as 



inf 



ijiin 11^ - 

l<j<k 



where || • || denotes Euclidean norm and the expectation is over X drawn at random from P. It is known ifTSi that this 
infimum is realized, though perhaps not uniquely, by some set of points /ii , . . . , /i/j, called a k-optimal set of centers. 
The resulting quantization error has been shown to be roughly fc^^/^ under a variety of different assumptions on P 
ijS). This is discouraging when D is high. For instance, if D = 1000, it means that to merely halve the error, you need 
2500 tiujgs as many codewords! In short, vector quantization is susceptible to the same curse of dimensionality that 
has been the bane of other nonparametric statistical methods. 

A recent positive development in statistics and machine learning has been the realization that a lot of data that 
superficially lie in a high-dimensional space M^, actually have low intrinsic dimension, in the sense of lying close to 
a manifold of dimension d D. We will give several examples of this below. There has thus been a huge interest 
in algorithms that learn this manifold from data, with the intention that future data can then be transformed into this 
low-dimensional space, in which the usual nonparametric (and other) methods will work well |[T8l [T6l |2l . 

In this paper, we are interested in techniques that automatically adapt to intrinsic low dimensional structure without 
having to explicitly learn this structure. We describe a tree-structured vector quantizer whose quantization error is 
^-i/o(d). jjjaj jg gay^ error rate depends only on the low intrinsic dimension rather than the high apparent 
dimension. The quantizer is based on a hierarchical decomposition of M^: first the entire space is split into two 
pieces, then each of these pieces is further split in two, and so on, until a partition of k cells is reached. Each codeword 
is the mean of the distribution restricted to one of these cells. 

Tree-structured vector quantizers abound; the power of our approach comes from the particular splitting method. 
To divide a region S into two, we pick a random direction from the surface of the unit sphere in R^, and split S at the 
median of its projection onto this direction (Figure[T]l. We call the resulting spatial partition a random projection tree 
or RP tree. 

At first glance, it might seem that a better way to split a region is to find the 2-optimal set of centers for it. However, 
as we explain below, this is an NP-hard optimization problem, and is therefore unlikely to be computationally tractable. 
Although there are several algorithms that attempt to solve this problem, such as Lloyd's method lfT2l[m . they are not 
in general able to find the optimal solution. In fact, they are often far from optimal. 
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Figure 1: Spatial partitioning of induced by an RP tree with three levels. The dots are data points; each circle 
represents the mean of the vectors in one cell. 

For our random projection trees, we show that if the data have intrinsic dimension d (in a sense we make precise 
below), then each split pares off about a l/d fraction of the quantization error. Thus, after log k levels of splitting, 
there are k cells and the quantization error is of the form (1 — l/d)^"^^ = fc"^/'^^''). There is no dependence at all on 
the extrinsic dimensionality D. 

2 Detailed overview 

2.1 Low-dimensional manifolds 

The increasing ubiquity of massive, high-dimensional data sets has focused the attention of the statistics and machine 
learning communities on the curse of dimensionality. A large part of this effort is based on exploiting the observation 
that many high-dimensional data sets have low intrinsic dimension. This is a loosely defined notion, which is typically 
used to mean that the data lie near a smooth low-dimensional manifold. 

For instance, suppose that you wish to create realistic animations by collecting human motion data and then fitting 
models to it. A common method for collecting motion data is to have a person wear a skin-tight suit with high contrast 
reference points printed on it. Video cameras are used to track the 3D trajectories of the reference points as the person 
is walking or running. In order to ensure good coverage, a typical suit has about N — 100 reference points. The 
position and posture of the body at a particular point of time is represented by a (3-/V)-dimensional vector. However, 
despite this seeming high dimensionality, the number of degrees of freedom is small, corresponding to the dozen-or-so 
joint angles in the body. The positions of the reference points are more or less deterministic functions of these joint 
angles. 

Interestingly, in this example the intrinsic dimension becomes even smaller if we double the dimension of the 
embedding space by including for each sensor its relative velocity vector In this space of dimension 6A^ the measured 
points will lie very close to the one dimensional manifold describing the combinations of locations and speeds that the 
limbs go through during walking or running. 

To take another example, a speech signal is commonly represented by a high-dimensional time series: the signal is 
broken into overlapping windows, and a variety of filters are applied within each window. Even richer representations 
can be obtained by using more filters, or by concatenating vectors corresponding to consecutive windows. Through 
all this, the intrinsic dimensionality remains small, because the system can be described by a few physical parameters 
describing the configuration of the speaker's vocal apparatus. 

In machine learning and statistics, almost all the work on exploiting intrinsic low dimensionality consists of algo- 
rithms for learning the structure of these manifolds; or more precisely, for learning embeddings of these manifolds into 
low-dimensional Euclidean space. Our contribution is a simple and compact data structure that automatically exploits 
the low intrinsic dimensionality of data on a local level without having to explicitly learn the global manifold structure. 
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Figure 2: Hilbert's space filling curve. Large neighborhoods look 2-dimensional, smaller neighborhoods look 1- 
dimensional, and even smaller neighborhoods would consist mostly of measurement noise and would therefore again 
be 2-dimensional. 

2.2 Defining intrinsic dimensionality 

Low-dimensional manifolds are our inspiration and source of intuition, but when it comes to precisely defining intrinsic 
dimension for data analysis, the differential geometry concept of manifold is not entirely suitable. First of all, any data 
set lies on a one-dimensional manifold, as evidenced by the construction of space-filling curves. Therefore, some 
bound on curvature is implicitly needed. Second, and more important, it is unreasonable to expect data to lie exactly 
on a low-dimensional manifold. At a certain small resolution, measurement error and noise make any data set full- 
dimensional. The most we can hope is that the data distribution is concentrated near a low-dimensional manifold of 
bounded curvature (Figure |2l). 

We address these various concerns with a statistically-motivated notion of dimension: we say that a set S has local 
covariance dimension {d, e, r) if neighborhoods of radius r have (1 — e) fraction of their variance concentrated in a 
d-dimensional subspace. To make this precise, start by letting al,a2, ■ ■ ■ , ult) denote the eigenvalues of the covariance 
matrix; these are the variances in each of the eigenvector directions. 

Definition 1 Set S C K.^ has local covariance dimension (d, e, r) if its restriction to any ball of radius r has covari- 
ance matrix whose largest d eigenvalues satisfy 

2.3 Random projection trees 

Our new data structure, the random projection tree, is built by recursive binary splits. The core tree-building algo- 
rithm is called MakeTree, which takes as input a data set S C M^, and repeatedly calls a splitting subroutine 
ChooseRule. 

procedure MakeTree(S') 
if IS"! < MinSize tlien return {Leaf) 
Ride ^ ChooseRule(5) 

LeftTree ^ MakeTree({x e S : Rule{x) = true}) 
RightTree ^ MAKETREE({a; e S : Rule{x) = false}) 
return {[Rule, LeftTree, RightTree]) 



The RP tree has two types of split. Typically, a direction is chosen uniformly at random from surface of the unit 
sphere and the cell is split at its median, by a hyperplane orthogonal to this direction. Occasionally, a different type of 
split is used, in which a cell is split into two pieces based on distance from the mean. 
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procedure ChooseRule(S') 
ifA2(^)<c.Ai(5) 
then /'''^°°^^ ^ random unit direction i; 

yRule{x) X ■ V < median({z ■ v : z £ S}) 
j Rule{x) := 

\||a;-mean(S')|| < median({i|z - mean(S')i| : z £ S}) 
return {Rule) 

In the code, c is a constant, A(S') is the diameter of S (the distance between the two furthest points in the set), and 
A^(S') is the average diameter, that is, the average distance between points of S: 



2.4 Main result 

Suppose an RP tree is built from a data set S C M^, not necessarily finite. If the tree has k levels, then it partitions 
the space into 2^ cells. We define the radius of a cell C C to be the smallest r > such that 5 n C C B{x, r) for 
some X E C. 

Recall that an RP tree has two different types of split; let's call them splits by distance and splits by projection. 

Theorem 2 There are constants < ci , C2 , C3 < 1 with the following property. Suppose an RP tree is built using data 
set S C MP . Consider any cell C of radius r, such that S D C has local covariance dimension {d, e, r), where e < Ci. 
Pick a point x S H C at random, and let C be the cell that contains it at the next level down. 

• IfC is split by distance, E [A(S' n C")] < C2A(5 n C). 

• IfC is split by projection, then E [A^(5 n C')] < (1 - (cs/d)) A^(S' n C). 

In both cases, the expectation is over the randomization in splitting C and the choice of x Cz S H C. 

2.5 The hardness of finding optimal centers 

Given a data set, the optimization problem of finding a fc-optimal set of centers is called the fc-means problem. Here 
is the formal definition. 

fc-MEANS CLUSTERING 

Input: Set of points xi, . . . ,Xn G M.^; integer k. 

Output: A partition of the points into clusters Ci , . . . , Ck, along with a center p,j for each cluster, so as to 
minimize 

x:eii-^-/^^-ii'- 

j=l tGCj 

The typical method of approaching this task is to apply Lloyd's algorithm IfTSlfTTI . and usually this algorithm is itself 
called /c-means. The distinction between the two is particularly important to make because Lloyd's algorithm is a 
heuristic that often returns a suboptimal solution to the fc-means problem. Indeed, its solution is often very far from 
optimal. 

What's worse, this suboptimality is not just a problem with Lloyd's algorithm, but an inherent difficulty in the 
optimization task. fc-MEANS CLUSTERING is an NP-hard optimization problem, which means that it is very unlikely 
that there exists an efficient algorithm for it. To explain this a bit more clearly, we delve briefly into the theory of 
computational complexity. 

The running time of an algorithm is typically measured as a function of its input/output size. In the case of fc- 
means, for instance, it would be given as a function of n, fc, and D. An efficient algorithm is one whose running time 
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scales polynomially with the problem size. For instance, there are algorithms for sorting n numbers which take time 
proportional to n log n; these qualify as efficient because n log n is bounded above by a polynomial in n. 

For some optimization problems, the best algorithms we know take time exponential in problem size. The famous 
traveling salesman problem (given distances between n cities, plan a circular route through them so that each city is 
visited once and the overall tour length is minimized) is one of these. There are various algorithms for it that take 
time proportional to 2" (or worse): this means each additional city causes the running time to be doubled! Even small 
graphs are therefore hard to solve. 

This disturbing lack of an efficient algorithm is not limited to just a few pathological optimization tasks. Rather, it 
is an epidemic across the entire spectrum of computational tasks, one that afflicts thousands of the problems we most 
urgently want to solve. Amazingly, it has been shown that the fates of these diverse problems (called NP-complete 
problems) are linked: either all of them admit efficient algorithms, or none of them do! The mathematical community 
strongly believes the latter to be the case, although it is has not been proved. Resolving this question is one of the 
seven "grand challenges" of the new millenium identified by the Clay Institute. 

In Appendix II, we show the following. 

Theorem 3 fc-MEANS clustering is an NP-hard optimization problem, even ifk is restricted to 2. 

Thus we cannot expect to be able to find a fc-optimal set of centers; the best we can hope is to find some set of centers 
that achieves roughly the optimal quantization error. 

2.6 Related work 
Quantization 

The literature on vector quantization is substantial; see the wonderful survey of Gray and Neuhoff ||9l for a com- 
prehensive overview. In the most basic setup, there is some distribution P over R"^ from which random vectors are 
generated and observed, and the goal is to pick a finite codebook C C and an encoding function a : M.^ C 
suchthatx a{x) for typical vectors cc. The quantization error is usually measured by squared loss, E||X — a(X)||^. 
An obvious choice is to let a{x) be the nearest neighbor of x in C. However, the number of codewords is often so 
enormous that this nearest neighbor computation cannot be performed in real time. A more efficient scheme is to have 
the codewords arranged in a tree ||4l . 

The asymptotic behavior of quantization error, assuming optimal quantizers and under various conditions on P, 
has been studied in great detail. A nice overview is presented in the recent monograph of Graf and Luschgy |8J. The 
rates obtained for fc-optimal quantizers are generally of the form fc^^/^. There is also work on the special case of data 
that lie exactly on a manifold, and whose distribution is within some constant factor of uniform; in such cases, rates 
of the form fc^^/'' are obtained, where d is the dimension of the manifold. Our setting is considerably more general 
than this: we do not assume optimal quantization (which is NP-hard), we have a broad notion of intrinsic dimension 
that allows points to merely be close to a manifold rather than on it, and we make no other assumptions about the 
distribution P. 

Compressed sensing 

The field of compressed sensing has grown out of the surprising realization that high-dimensional sparse data can be 
accurately reconstructed from just a few random projections |[3]|5|. The central premise of this research area is that 
the original data thus never even needs to be collected: all one ever sees are the random projections. 

RP trees are similar in spirit and entirely compatible with this viewpoint. Theorem |2] holds even if the random 
projections are forced to be the same across each entire level of the tree. For a tree of depth k, this means only fc 
random projections are ever needed, and these can be computed beforehand (the split-by-distance can be reworked to 
operate in the projected space rather than the high-dimensional space). The data are not accessed in any other way. 
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3 An RP tree adapts to intrinsic dimension 



An RP tree has two varieties of split. If a cell C has much larger diameter than average-diameter (average interpoint 
distance), then it is split according to the distances of points from the mean. Otherwise, a random projection is used. 
The first type of split is particularly easy to analyze. 

3.1 Splitting by distance from the mean 

This option is invoked when the points in the current cell, call them S, satisfy A^(S') > cA^(5); recall that A(5) is 
the diameter of S while A^(5) is the average interpoint distance. 

Lemma 4 Suppose that A^(S') > cA^(S'). Let Si denote the points in S whose distance to mean(S') is less than or 
equal to the median distance, and let S2 be the remaining points. Then the expected squared diameter after the split is 

The proof of this lemma is deferred to the Appendix, as are most of the other proofs in this paper. 

3.2 Splitting by projection: proof outline 

Suppose the current cell contains a set of points S C K.^ for which A^(5) < cA^(S'). We will show that a split by 
projection has a constant probability of reducing the average squared diameter A^(S') by 0(A^(S')/(i). Our proof 
has three parts: 

I. Suppose S is split into 5*1 and 5*2, with means fii and /i2. Then the reduction in average diameter can be 
expressed in a remarkably simple form, as a multiple of ||/ii — /^2|P- 

II. Next, we give a lower bound on the distance between the projected means, {pi — ^(2)^- We show that the 
distribution of the projected points is subgaussian with variance 0{/^\{S) / D). This well-behavedness implies 

\h2A{'iii^'fi2? = n{/^\{s)/D). 

III. We finish by showing that, approximately, 11 /ii —/i2 11^ > [D / d){'jli — 'jl2Y ■ This is because /^i — /^2 lies close to 
the subspace spanned by the top d eigenvectors of the covariance matrix of S; and with high probability, every 
vector in this subspace shrinks by 0{^J d/ D) when projected on a random line. 

We now tackle these three parts of the proof in order 

3.3 Quantifying the reduction in average diameter 

The average squared diameter A^(S') has certain reformulations that make it convenient to work with. These prop- 
erties are consequences of the following two observations, the first of which the reader may recognize as a standard 
"bias-variance" decomposition of statistics. 

Lemma 5 Let X, Y be independent and identically distributed random variables in M", and let z G R" be any fixed 
vector 

(a) E [\\X - zf] = E [\\X - EX||2] + \\z - EX||2. 

(b) E [\\X - = 2E [\\X - EX||2]. 

Proof. Part (a) is immediate when both sides are expanded. For (b), we use part (a) to assert that for any fixed y, we 
have E [\\X - yf] = E [\\X - ¥.Xf] + \\y - EXjp. We then take expectation over Y ^y.l 

This can be used to show that the averaged squared diameter, A\{S), is twice the average squared distance of points 
in S from their mean. 
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Corollary 6 The average squared diameter of a set S can also be written as: 

Ai(5) = A^||^_mean(5)|p. 

Proof. A^(S') is simply E — , when X, Y are i.i.d. draws from the uniform distribution over S. I 

At each successive level of the tree, the current cell is split into two, either by a random projection or according to 
distance from the mean. Suppose the points in the current cell are S, and that they are split into sets Si and 5*2. It is 
obvious that the expected diameter is nonincreasing: 

AiS) > ^A(5i) + ^A(52). 

This is also true of the expected average diameter. In fact, we can precisely characterize how much it decreases on 
account of the split. 

Lemma 7 Suppose set S is partitioned (in any manner) into Si and S2- Then 

,2,c^ /l^llA2,c^ , \S2\^2.nA 2|5i|.|52| 



KiS) - i ^AK^i) + ^Aji(52)| = ' |^j2 l|mean(5i) - mean(52) 



This completes part I of the proof outline. 



3.4 Properties of random projections 

Our quantization scheme depends heavily upon certain regularity properties of random projections. We now review 
these properties, which are critical for parts II and III of our proof. 

The most obvious way to pick a random projection from to M. is to choose a projection direction u uniformly 
at random from the surface of the unit sphere S^^^, and to send x i-^ u ■ x. 

Another common option is to select the projection vector from a multivariate Gaussian distribution, u ~ A^(0, [1/ D)Io). 
This gives almost the same distribution as before, and is slightly easier to work with in terms of the algorithm and 
analysis. We will therefore use this type of projection, bearing in mind that all proofs carry over to the other variety as 
well, with slight changes in constants. 

The key property of a random projection from MP to R is that it approximately preserves the lengths of vectors, 
modulo a scaling factor of ^/lJ. This is summarized in the lemma below. 



Lemma 8 Fix any x £ MP. Pick a random vector U ^ -/V(0, (1/ D)Io)- Then for any a, (3 > 0: 



(a) P 

(b) P 



\U -xl <a- 



\U-x\>(3-^ 



- 0^ 



Lemma [8] applies to any individual vector Thus it also applies, in expectation, to a vector chosen at random from 
a set 5 C M^. Applying Markov's inequality, we can then conclude that when S is projected onto a random direction, 
most of the projected points will be close together, in a central interval of size 0{A{S) / VD). 



Lemma 9 Suppose S C lies within some ball B{xq , A). Pick any < S,e < 1 such that Se < 1 /e^. Let v be any 
measure on S. Then with probability > 1 — (5 over the choice of random projection U onto M, all but an e fraction of 

^ofU-xo. 



U ■ S (measured according to v) lies within distance 
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As a corollary, the median of the projected points must also lie within this central interval. 

Corollary 10 Under the hypotheses of Lemma^ for any < S < the following holds with probability at least 

1 — 5 over the choice of projection: 



A / 2 

|median(C/ • S") - C/ • xol < — =--y'21n-. 



Proof. Let i' be the uniform distribution over S and use e = 1/2. 1 

Finally, we examine what happens when the set S* is a rf-dimensional subspace of M^. Lemma [8] tells us that the 
projection of any specific vector x £ S is unlikely to have length too much greater than | j a; j | / VD, with high probability. 
A slightly weaker bound can be shown to hold for all of S simultaneously; the proof technique has appeared before in 
several contexts, including lfT4l [n. 

Lemma 11 There exists a constant ni with the following property. Fix any S > and any d-dimensional subspace 
H C K^. Pick a random projection U ^ N{0, {1/ D)Id). Then with probability at least 1 — S over the choice ofU, 



\x-UV ^ ,. d + In 1/(5 



Proof. It is enough to show that the inequality holds for S — H (1 (surface of the unit sphere in M ). Let N be 
any (l/2)-cover of this set; it is possible to achieve |iV| < 10** ||T3| . Apply Lemma[8] along with a union bound, to 
conclude that with probability at least 1 — S over the choice of projection U, 

, rr,2 „ InlTVl +lnl/,5 
xeN D 

Now, define C by 

C = sup (\x ■ UV ■ , , . 

,/s\ ln|iV|+lnl/5y 

We'll complete the proof by showing C < 8. To this end, pick the x* E S for which the supremum is realized (note 
S is compact), and choose y £ N whose distance to x* is at most 1/2. Then, 

\x*-U\ < \y -U\ + \{x* ~y) -U\ 
^ ^ln\N\ + In 1/5 

From the definition of x\ it follows that VC < V2 + VC /2 and thus C < 8. I 

3.5 Properties of the projected data 

Projection from into shrinks the average squared diameter of a data set by roughly D. To see this, we start with 
the fact that when a data set with covariance A is projected onto a vector U, the projected data have variance U^AU. 
We now show that for random U, such quadratic forms are concentrated about their expected values. 

Lemma 12 Suppose A is an n x n positive semidefinite matrix, and U ^ N(0, {l/n)In). Then for any a, f3 > 0: 

(a) fp'^AU < a ■ Ep'^AU]] < e-((i/2)~«)/2^ and 

(b) ¥[U^AU > (3 ■ Ep'^AU]] < e-(/3-2)/4 
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Lemma 13 Pick U ^ N{Q, {\/ D)In). Then for any S C M^, with probability at least 1 / 10, the projection of S onto 
U has average squared diameter 



Ai(5 • U) > 



AD 



Proof. By Corollary|6l 

Ai(5.f/) = ^^((.T-mean(5)).C/)2 = 2C/^cov(5)C/. 

where cov(S') is the covariance of data set 5*. This quadratic term has expectation (over choice of U) 

E[2U^co\{S)U] = 2^E[C/iC/j]cov(S')y 

i 

Lemma [T2l" a) then bounds the probability that it is much smaller than its expected value. I 

Next, we examine the overall distribution of the projected points. When S C has diameter A, its projection 
into the line can have diameter upto A, but as we saw in Lemma|9] most of it will lie within a central interval of size 
0{A/\/D). What can be said about points that fall outside this interval? 

Lemma 14 Suppose 5 C 5(0, A) C R^. Pick any S > and choose U ^ N{0, {1/D)Id)- Then with probability at 
least 1 — 5 over the choice ofU, the projection S - U = {x - U : x Cz S} satisfies the following property for all positive 
integers k. 

The fraction of points outside the interval ( ^"T^i ^"Tc j tnost — ■ e~ ' . 



Proof. This follows by applying Lemma|9]for each positive integer k (with corresponding failure probability 5/2^), 
and then taking a union bound. I 



3.6 Distance between the projected means 

We are dealing with the case when A'^(S') < c • A^(5'), that is, the diameter of set S is at most a constant factor times 
the average interpoint distance. If S is projected onto a random direction, the projected points will have variance about 
A\{S) /D, by Lemma[T3l and by LemmafT4l it isn't too far from the truth to think of these points as having roughly a 
Gaussian distribution. Thus, if the projected points are split into two groups at the mean, we would expect the means 
of these two groups to be separated by a distance of about A^ {S) /^/15. Indeed, this is the case. The same holds if we 
split at the median, which isn't all that different from the mean for close-to-Gaussian distributions. 

Lemma 15 There is a constant K2for which the following holds. Pick any < S < l/16c. Pick U ~ N(0, {1/D)Id) 
and split S into two pieces: 

Si — {x ^ S : X ■ U < s} and S2 = {x ^ S : x ■ U > s} , 

where s is either mean(S' • U) or median(S ■ U). Write p ~ |S'i|/|S'|, and let Jli and JI2 denote the means of Si ■ U 
and 5*2 • U, respectively. Then with probability at least 1/10 — 5, 



(M2 - IJ-lf > K2 



1 A^iS) 1 



(p(l-p))2 D clogil/Sy 
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Proof. Let the random variable X denote a uniform-random draw from the projected points S ■ U . Without loss of 
generality mean(5) — 0, so that EX = and thus p/ii + (1 —p)pt2 — 0. Rearranging, we get /ii — — (1 — p)(pt2 ^ Mi) 
and pt2 = p(M2 - Ml)- 

We already know from Lemma[T3](and Corollary|6|l that with probability at least 1/10, the variance of the projected 
points is significant: var(X) > A\{S)/8D. We'll show this implies a similar lower bound on (/t2 — Mi)^- 
Using l(-) to denote — 1 indicator variables, 

var(X) < E[{X - sf] 

< ]K[2t\X -s\ + i\X - s| - tf ■ 1{\X -s\> t)] 

for any t > 0. This is a convenient formulation since the linear term gives us /i2 ~ Mi- 

E[2t\X-s\] = 2t{p{s - Jli) + (l - p){li2 ~ s)) 

= 4<-p(l -p) • (/la -Ml) + 2ts(2p- 1). 

The last term vanishes since the split is either at the mean of the projected points, in which case s = 0, or at the 
median, in which case p = 1/2. 
Next, we'll choose 

A{S) f T 



for some suitable constant to, so that the quadratic term in var(X) can be bounded using Lemma[T4land CorollarvfTOl 
with probability at least 1 — 6, 



^2 / . ^HS) 



E[i\X\-tr-l{\X\>t)] < 6- 



D 

(this is a simple integration). Putting the pieces together, we have 

< var(X) < 4i.Ml-p)-(M2-Mi)+'5-^. 
The result now follows immediately by algebraic manipulation, using the relation (S*) < cA\ {S).f 



3.7 Distance between the high-dimensional means 

SpUt S into two pieces as in the setting of LemmafTSl and let /ii and fi2 denote the means of and S2, respectively. 
We already have a lower bound on the distance between the projected means, /i2 — Mi^ we will now show that |l/i2 — Mill 
is larger than this by a factor of about ^ Djd. The main technical difficulty here is the dependence between the [ii 
and the projection I] . Incidentally, this is the only part of the entire argument that exploits intrinsic dimensionality. 

Lemma 16 There exists a constant K3 with the following property. Suppose set S C MP is such that the top d 
eigenvalues of co'v{S) account for more than 1 — e of its trace. Pick a random vector U ~ -/V(0, (1/ D)l£)), and split 
S into two pieces. Si and S2, in any fashion (which may depend upon U). Let p = |5'i|/|5'|. Let fii and M2 the 
means of Si and S2, and let Mi and M2 be the means of Si ■ U and S2 ■ U. 
Then, for any 6 > 0, with probability at least 1 — 5 over the choice of U, 

„2 ^ ^^30 f ^ ,2 4 eA\{S)\ 

ii^^-^^ii ^ m^r^-^^) - p(i-p) 5D )■ 



Proof. Assume without loss of generality that S has zero mean. Let H denote the subspace spanned by the top 
d eigenvectors of the covariance matrix of S, and let be its orthogonal subspace. Write any point x G as 
xh + x^, where each component is seen as a vector in MP that lies in the respective subspace. 
Pick the random vector U; with probability > 1 — 5 it satisfies the following two properties. 
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Property 1 : For some constant n' > 0, for every x E M.^ 

This holds (with probability 1 — 5/2) by LemmafTTI 

Property 2: Letting X denote a uniform-random draw from S, we have 

Ex [{X^ ■ C/)2] < ^ . E,Mx [iXi_ ■ U f] 



= yExEu[iX±-Uf] 

= ^-Exiwx.r] < 

6D ^" " ^ - 6D 

The first step is Markov's inequality, and holds with probability 1 — S/2. The last inequality comes from the local 
covariance condition. 

So assume the two properties hold. Writing fj,2 — Mi {h2H — Miff) + (P2± — Mi±), 

< 2{{fi2H - /im) • Uf + 2{{fi2± - Mi±) • U)\ 
The first term can be bounded by Property 1 : 

{{.IJ.2H - I^Ih) ■ Uy < 11^2 -Ml 



2 / II ii2 / d + lnl/5 



D 

For the second term, let Ex denote expectation over X chosen uniformly at random from S. Then 

((M2± - Mi±) • C^)' < 2(Ai2± •[/)' + 2(mi± • C/)' 

= 2{Ex[X^_ -UIX e 82]^ + 2{Ex[Xi_ • [/ | X G Si])^ 

< 2Ex [{X^ -UyiXe S2] + 2Ex [iX^ -UflXeSi] 

< -^.Ex[{X^-Uy] + --Ex[{X^-Uf] 
1-p p 

by Property 2. The lemma follows by putting the various pieces together I 
We can now finish off the proof of Theorem|2] 

Theorem 17 Fix any e < 0(1/ c). Suppose set S C has the property that the top d eigenvalues o/cov(5) account 
for more than 1 — e of its trace. Pick a random vector U ^ N{0, (l/D)/^)) and split S into two parts. 

Si = {x ^ S : X ■ U < s} and S2 = {x ^ S : x ■ U > s} , 

where s is either mean(S' • U) or median{S ■ U). Then with probability ri(l), the expected average diameter shrinks 
byn{A\{S)/cd). 

Proof. By Lemma|7] the reduction in expected average diameter is 

^i(^)-(S^A(^i) + W^A(52)| = ^'^^':f^' ||mean(5i)-mean(52)f, 



, \S\ " \S\ \S\^ 
or 2p(l — — /i2|P in the language of Lemmas [T5] and [T6l The rest follows from those two lemmas. I 
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4 Appendix I: Proofs of main theorem 



4.1 Proof of Lemma 1 

Since U has a Gaussian distribution, and any linear combination of independent Gaussians is a Gaussian, it follows 
that the projection U ■ x is also Gaussian. Its mean and variance are easily seen to be zero and respectively. 
Therefore, writing 



we have that Z ^ N{0, 1). The bounds stated in the lemma now follow from properties of the standard normal. In 
particular, A^(0, 1) is roughly flat in the range [—1,1] and then drops off rapidly; the two cases in the lemma statement 
correspond to these two regimes. 

The highest density achieved by the standard normal is 1 /\/27r. Thus the probabihty mass it assigns to the interval 
[—a, a] is at most laj^ph^; this takes care of (a). For (b), we use a standard tail bound for the normal, P(|^| > f3) < 
{2/(3)e~^ see, for instance, page 7 of |7|. 

4.2 Proof of Lemma |9] 



Set c = v/21nl/((5e) > 2. 

Fix any point x, and randomly choose a projection U. Let x = U ■ x (and likewise, let S = U ■ S). What is the 
chance that x lands far from xq? Define the bad event to be F^; = l(|x — xqI > cA/\/D). By Lemma[8tb), we have 



i~ ~ I ^ 2;o|| 



D 



< < Se. 



Since this holds for any x G S, it also holds in expectation over x drawn from We are interested in bounding the 
probability (over the choice of U) that more than an e fraction of ly falls far from xq. Using Markov's inequaUty and 
then Fubini's theorem, we have 

Vu[Ef,[F^] >e] < ^ " = -'^ < 6, 

as claimed. 

4.3 Proof of Lemma |4] 

Let random variable X be distributed uniformly over S. Then 



P[|lX-EXf > median(||X-EX||2)] 



1 

> - 
~ 2 



by definition of median, so E [\\X - EX\\^] > median(||X - EXf)/2. It follows from Corollary that 

median(||X - EX\\^) < 2E [\\X - EX\\^] = A\{S). 

Set 5*1 has squared diameter A^(5i) < (2median(||X-EX||))^ < 4A^(S'). Meanwhile, 52 has squared diameter 
at most A^(S'). Therefore, 

and the lemma follows by using A^(S') > cA\{S). 
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4.4 Proof of Lemma [12] 

This follows by examining the moment-generating function of U^AU. Since the distribution of U is spherically 
symmetric, we can work in the eigenbasis of A and assume without loss of generality that A = diag(ai, . . . , a„), 
where ai, . . . , a„ are the eigenvalues. Moreover, for convenience we take J^'^i — 1- 

Let Ui, . . . ,Un denote the individual coordinates of U. We can rewrite them as Ui = Zi / ^/n, where Zi , . . . , Z„ 
are i.i.d. standard normal random variables. Thus 



This tells us immediately that E[C/^Af7] = 

We use Chernoff 's bounding method for both parts. For (a), for any t > 0, 



\U'^AU < a ■ Ep'^AU]] = 



< a 



> e 



< 



— taiZj 



wi— 



1/2 



and the rest follows by using t = 1/2 along with the inequality 1/(1 + x) < e for < a; < 1. Similarly for (b), 
for < t < 1 /2, 



[U^AU > P ■ E[U^AU]] = 



J2a,Zf>(3 



< 



J- J- V 1 - 2ta, 



-tp 



1/2 



and it is adequate to choose t — 1/A and invoke the inequality 1/(1 — x) < e for < x < 1/2. 



4.5 Proof of Lemma U\ 

Let fii, 1^12 denote the means of S, Si, and 5*2. Using Corollary |6] and Lemma|5ja), we have 



\Si 



\S\ 



\S\ 



2 V"^ II ||2 I'S'll 2 2 1^2 



\s\ 



Si 



\S\ \S2 



El 



a; - M2I 



^ |e (II- - MiP - II- - A^iiP) + E (II- - MiP - b - I 

\ Si S2 ) 



2|5l|,| ||2 

Vi - m 



V2 - A^ll • 



\s\ '-^ 1^1 

Writing /^i as a weighted average of jUi and /i2 then completes the proof. 
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5 Appendix II: Hardness of /c -means clustering 

/c-MEANS CLUSTERING 

Input: Set of points . . . , x„ e M''; integer k. 

Output: A partition of the points into clusters Ci , . . . , Cfc, along with a center /ij for each cluster, so as to 
minimize 

k 

3 = 1 leCj 

(Here || • || is Euclidean distance.) It can be checked that in any optimal solution, /ij is the mean of the points in Cj. 
Thus the {/Xj} can be removed entirely from the formulation of the problem. From Lemma|3b), 

Therefore, the fc-means cost function can equivalently be rewritten as 

We consider the specific case when k is fixed to 2. 
Theorem 18 2-means clustering is an NP-hard optimization problem. 

This was recently asserted in ||6|, but the proof was flawed. We establish hardness by a sequence of reductions. Our 
starting point is a standard restriction of 3SAT that is well known to be NP-complete. 

3SAT 

Input: A Boolean formula in 3CNF, where each clause has exactly three literals and each variable appears 
at least twice. 

Output: true if formula is satisfiable, false if not. 

By a standard reduction from 3SAT, we show that a special case of NOT-ALL-EQUAL 3SAT is also hard. For com- 
pleteness, the details are laid out in the next section. 

NaeSat* 

Input: A Boolean formula 0(.ti , . . . , x„) in 3CNF, such that (i) every clause contains exactly three literals, 
and (ii) each pair of variables Xi, Xj appears together in at most two clauses, once as either {xi^Xj} or 
{xi^Xj}, and once as either {xi^Xj} or {xi^Xj}. 

Output: true if there exists an assignment in which each clause contains exactly one or two satisfied 
literals; false otherwise. 

Finally, we get to a generalization of 2-MEANS. 

Generalized 2-means 

Input: An n x n matrix of interpoint distances Dij . 

Output: A partition of the points into two clusters C\ and C2, so as to minimize 
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We reduce NaeSat* to Generalized 2-means. For any input 4> to NaeSat*, we show how to efficiently produce 
a distance matrix D(0) and a threshold c((/)) such that satisfies NaeSat* if and only if D{(t>) admits a generalized 
2-means clustering of cost < c{(f)). 

Thus Generalized 2-means clustering is hard. To get back to 2-means (and thus estabhsh TheoremfTSli, 
we prove that the distance matrix -D(0) can in fact be realized by squared Euclidean distances. This existential fact is 
also constructive, because in such cases, the embedding can be obtained in cubic time by classical multidimensional 
scaling ifTOl . 

5.1 Hardness of NaeSat* 

Given an input 0(xi , . . . , 2:„) to 3SAT, we first construct an intermediate formula 4/ that is satisfiable if and only if 
is, and additionally has exactly three occurrences of each variable: one in a clause of size three, and two in clauses of 
size two. This ((>' is then used to produce an input 0" to NaeSat*. 

1. Constructing 0'. 

Suppose variable Xi appears k > 2 times in (f>. Create k variables xn, . . . , Xik for use in 4>': use the same 
clauses, but replace each occurrence of Xi by one of the Xij . To enforce agreement between the different copies 
Xij, add k additional clauses {xn V Xi2), {xi2 V Xis), . . . , {xik,Xii). These correspond to the imphcations 

Xl X2,X2 X3, . . . ,Xk ^ Xi. 

By design, is satisfiable if and only if 0' is satisfiable. 

2. Constructing cf)". 

Now we construct an input cfi" for NaeSat*. Suppose (j)' has m clauses with three literals and m' clauses with 
two literals. Create 2m + m' + 1 new variables: si, . . . , s™ and /i, . . . , fm+m' and /. 

If the jth three-literal clause in (/)' is (a V/3 V7), replace it with two clauses in (j>": (a V/3 V Sj) and {sj VjVfj). 
If the jth two-literal clause in 0' is {a V /?), replace it with (a V /3 V fm+j) in (f>" ■ Finally, add m + m' clauses 
that enforce agreement among the /,: (/^ V /2 V /), (/a V /s V /),... , {7m+m' V /i V /). 

All clauses in 0" have exactly three literals. Moreover, the only pairs of variables that occur together (in clauses) 
more than once are {fi, /} pairs. Each such pair occurs twice, as {fi, /} and /}. 

Lemma 19 0' is satisfiable if and only if (j)" is not-all-equal satisfiable. 

Proof. First suppose that 4>' is satisfiable. Use the same settings of the variables for cf)" . Set / = /i = • • • = fm+m' = 
false. For the jth three-literal clause (a V /3 V 7) of (/)', if a = /3 = false then set Sj to true, otherwise set sj to 
false. The resulting assignment satisfies exactly one or two literals of each clause in 0". 

Conversely, suppose 0" is not-all-equal satisfiable. Without loss of generality, the satisfying assignment has / set 
to false (otherwise flip all assignments). The clauses of the form (/j V /i+i V /) then enforce agreement among 
all the /, variables. We can assume they are all false (otherwise, once again, flip all assignments). This means the 
two-literal clauses of 0' must be satisfied. Finally, consider any three-literal clause (a V /3 V 7) of 0' . This was replaced 
by (a V /3 V Sj) and {sj V 7 V fj) in 0". Since fj is false, it follows that one of the hterals a, (3, 7 must be satisfied. 
Thus (p' is satisfied. I 

5.2 Hardness of Generalized 2-means 

Given an instance , . . . , a;„) of NaeS AT*, we construct a 2n x 2n distance matrix D = D{4i) where the (implicit) 
2n points correspond to literals. Entries of this matrix will be indexed as Da,/!, for a, /3 E {xi , . . . ,Xn,xi, . . . , x„}. 
Another bit of notation: we write a ~ /3 to mean that either a and /? occur together in a clause or a and (3 occur 
together in a clause. For instance, the clause (x V y V z) allows one to assert x ^ y but not x ^ y. The input 
restrictions on NaeSat* ensure that every relationship a ~ /3 is generated by a unique clause; it is not possible to 
have two different clauses that both contain either {a, f3} or {a, (3}. 
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Define 



Da,p — 



if a = /3 

1 + A \fa=]3 
1 + 5 if a ~ /3 
1 otherwise 



Here < (5 < A < 1 are constants such that 4(5m < A < 1 — 25n, where m is the number of clauses of 0. One valid 
setting is (5 = l/(5m + 2n) and A = bSm. 

Lemma 20 If (j) is a satisfiable instance o/NaeSat*, then D{(j)) admits a generalized 2-means clustering of cost 
c{<j}) = n — 1 + 26m /n, where m is the number of clauses of (j). 

Proof. The obvious clustering is to make one cluster (say Ci) consist of the positive literals in the satisfying not- 
all-equal assignment and the other cluster (C2) the negative literals. Each cluster has n points, and the distance 
between any two distinct points a, (3 within a cluster is either 1 or, if a ~ /?, 1 + 5. Each clause of (f) has at least 
one literal in Ci and at least one literal in C2, since it is a not-all-equal assignment. Hence it contributes exactly one 
~ pair to Ci and one ~ pair to C2. The figure below shows an example with a clause {xWyWz) and assignment 
X = true,?/ = z = false. 



Thus the clustering cost is 
1 

2n 



E 

i,i'eCi 



1 

2n 



E Dr. 
i,i'eC2 



n-1 



n 
2 

2Sr 



m5 



I 

Lemma 21 Let Ci, C2 be any 2-clustering of D((j)). If Ci contains both a variable and its negation, then the cost of 
this clustering is at least n — 1 + A/ {2n) > c(0). 

Proof. Suppose Ci has n' points while C2 has 2n — n' points. Since all distances are at least 1, and since Ci contains 
a pair of points at distance 1 + A, the total clustering cost is at least 



1 



A 



1 f2n - n' 
2n - n' V 2 



A A 

n - IH > n - IH . 

n' - 2n 



Since A > ASm, this is always more than c(0). I 



17 




Lemma 22 If D{(j}) admits a 2-clustenng of cost < c{(j)), then (j) is a satisfiable instance o/NaeSat*. 

Proof. Let Ci , C2 be a 2-clustering of cost < c{(j>). By the previous lemma, neither Ci nor C2 contain both a variable 
and its negation. Thus \Ci \ = IC2 1 = n. The cost of the clustering can be written as 

1 if clause split between Ci, C2 
3 otherwise 

clauses 

Since the cost is < c{(j)), it follows that all clauses are split between Ci and C2, that is, every clause has at least one 
literal in Ci and one literal in C2. Therefore, the assignment that sets all of Ci to true and all of C2 to false is a 
valid NaeSat* assignment for 0. I 

5.3 Embeddability of (0) 

We now show that D{(f)) can be embedded into in the sense that there exist points G K^" such that D^^p = 
\\xa — xpW^ for all a, (3. We rely upon the following classical result [|171 . 

Theorem 23 (Schoenberg) Let H denote the matrix I— (l/iV)ll^. An NxN symmetric matrix D can be embedded 
into I2 if and only if —HDH is positive semidefinite. 

The following corollary is immediate. 

Corollary 24 An NxN symmetric matrix D can be embedded into I2 if and only ifii^Du < Qfor all u E M.^ with 
u-l = 0. 

Proof. Since the range of the map v t-^ Hv is precisely {li e : it • 1 = 0}, we have 

-HDH is positive semidefinite ^ v'^HDHv < for all v e 

^ u^Du < for all it e with u • 1 = 0. 

I 

Lemma 25 D{(f>) can be embedded into 

Proof. If is a formula with variables xi, . . . , Xn, then D ~ D{4>) is a2n x 2n matrix whose first rows/columns 
correspond to xi , . . . , a;„ and remaining rows/columns correspond to , . . . , x„. The entry for literals (a, (3) is 

Dc^p = 1 - l(a = /3) + A • l(a = /3) + (5 • l(a - (3), 

where l(-) denotes the indicator function. 

Now, pick any u E M^" with u • 1 = 0. Let it+ denote the first n coordinates of u and the last n coordinates. 

= Uo^up (1 - l(a = /3) + A • l(a = ;5) + (5 • l(a ~ /?)) 

a,/3 OL a a,/3 



< 



< -{1- A)\\uf + 2S\\ufn 
where the last step uses the Cauchy-Schwarz inequality. Since 2Sn < 1 — A, this quantity is always < 0. 1 
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